Introduction

MLB Team Success Trends Throughout Eras

Column

Motivation

The purpose of this study was to perform statistical analysis on over 150 years of MLB single-season team statistical data in hopes of discovering historical success trends through eras. Runs (R) and runs allowed (RA) were used as the determinants for success.

This sort of analysis has been heavy utilized within the MLB over the past 20 years and is essential to franchise success.

From integration to juicing, baseball has changed repeatedly over the past 150 years. Thus, in order to discover historical trends, it was necessary to break up the data in accordance with the MLB’s nine eras:

  • Pre-1900 (1871-1900)
  • Deadball (1901-1920)
  • Liveball (1921-1942)
  • Integration (1943-1961)
  • Expansion (1962-1977)
  • Free Agent (1978-1994)
  • Steroid (1995-2004)
  • Contemporary (2005-2014)
  • Present Day (2015-)

Sources:

Sean Lahman Baseball Database

Baseball EDA

Time Series and Forecasting in R

Column

Variable Index

The following explanatory variables were the focus of our analysis:

  • Hits (H)

  • Hits Allowed (HA)

  • On-base Percentage (OBP)

  • On-base Percentage plus Slugging Percentage (OPS)

  • Walks (BB)

  • Walks Allowed (BBA)

  • Strikeout (SO)

  • Strikeout Allowed (SOA)

  • Home Runs (HR)

  • Home Runs Allowed (HRA)

  • Fielding Percentage (FP)

  • Errors (E)

Era Division

Column

The following interactive tables display all team statistics from all nine of the MLB’s perspective eras.

Pre-1900

Deadball

Liveball

Integration

Expansion

Free Agent

Steroid

Contemporary

Present-Day

EDA

Column

Runs by Year

Runs by Period

Column

Explanation

The scatterplot to the left displays the median runs earned per season of the MLB; from the plot, it is clear there exists a very weak positive correlation between time and runs earned; a potential reason for this could be that although batting has vastly improved overtime, the explosion of offensive talent has been mitigated over time by an additional steady improvement in ptiching and fielding.

The boxplot of runs for each time period shows how scoring has changed throughout the history of baseball. The Pre-1900 Era has the widest range because the league was not regulated, so the player skill level varied wildly, thus leading to highly-skilled teams beating up on teams with subpar players. The Dead Ball Era has the lowest median of any era, which aligns with the reasoning behind the era’s name. The Steroid ERA has the highest median, which makes sense given the increase in home runs due to performance-enhancing drugs.

Correlation Exploration

The following correlelograms display, using color, the various correlations within both our offensive and defensive explanatory variable groups.

Column

Pre-1900

Dead Ball

Live Ball

Integration

Expansion

Free Agent

Steroid

Contemporary

Present-Day

Column

Results

Based on our correlograms of each era, we found hits and OPS to be the most important statistics in influencing the amount of runs scored, as one or both of the two had the highest correlation with runs. When comparing the Dead Ball and Live Ball Eras, we saw an increase in the correlation between OPS and runs, which is likely due to the introduction of power hitting to the sport. OPS’s correlation with runs decreased between the Contemporary and Present Day Eras, which makes sense when considering the developments of the past decade. The defensive shift and power-focused hitting approaches have grown in popularity, limiting the amount of players getting on base compared to other eras. Also, more pitchers have extremely high spin rates, which makes pitches much harder to hit. Prior to the 2023 season, the MLB implemented new rules in an attempt to create more offense during games. The defensive shift is now banned, so hitters, especially left-handed ones, will get on base more often. Despite this decrease in OPS correlation, hits still hold the strongest correlation with runs in the Present Day Era.

Time Series

Column

Runs Allowed Model


Call:
lm(formula = RA ~ HA + year, data = dfRA1)

Residuals:
     Min       1Q   Median       3Q      Max 
-116.669  -34.620   -4.987   33.762  231.750 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 681.66244  247.30590   2.756  0.00658 ** 
HA            0.45501    0.02505  18.168  < 2e-16 ***
year         -0.31543    0.13596  -2.320  0.02171 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59.76 on 148 degrees of freedom
Multiple R-squared:  0.7426,    Adjusted R-squared:  0.7391 
F-statistic: 213.5 on 2 and 148 DF,  p-value: < 2.2e-16
    year   HA       RA
152 2023 1389 675.5523
153 2024 1389 675.2369
154 2025 1389 674.9214

Runs Model


Call:
lm(formula = R ~ OPS + H + year, data = dfR1)

Residuals:
    Min      1Q  Median      3Q     Max 
-86.434 -29.819  -8.594  20.242 151.631 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  793.94837  197.84714   4.013 9.52e-05 ***
OPS         1163.38858  123.85687   9.393  < 2e-16 ***
H              0.33081    0.02526  13.098  < 2e-16 ***
year          -0.70550    0.11390  -6.194 5.58e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 48.16 on 147 degrees of freedom
Multiple R-squared:  0.8445,    Adjusted R-squared:  0.8413 
F-statistic: 266.1 on 3 and 147 DF,  p-value: < 2.2e-16
    year    H        R   OPS
152 2023 1389 654.5484 0.712
153 2024 1389 653.8429 0.712
154 2025 1389 653.1374 0.712

Column

Runs Trend

Findings

We noticed a decreasing trend in the amount of runs being scored over the last decade. The Present Day Era has seen a drastic decrease in scoring compared to previous eras, so we used a time series to test if this downward trend would continue. Our findings show a negative correlation between runs and year, as well as runs allowed and year. For the runs allowed model, we saw a negative correlation of 0.32 runs allowed per season year over year, controlling for hits allowed and using the median values of runs allowed. This effect of year on runs allowed is statistically significant at a 5% significance level. The model used for runs shows a negative correlation of 0.71 runs per season year over year, controlling for hits and OPS and using the median values of runs. This effect is significant at a 1% level, which shows the impact better pitching and less-disciplined hitting has had on the game of baseball. The runs model accounted for 84% of the variation in runs year over year, and the runs allowed model accounted for 74% of the variation in runs allowed year over year. Based on the time series, we predicted the median number of runs for each of the next three seasons (2023-2025) while keeping our predictor variables of hits, OPS, and hits allowed constant at their median values for the entire dataset; these predictions are displayed underneath the two linear regression models, and support the negative correlation found.

About the Authors

Column

Our Background

Jesse:

My name is Jesse Devitt and I am an undergraduate student at the University of Dayton. Currently, my projected graduation is May 2024.

Right now, I am working towards completing a B.S. in Applied Mathematical Economics with a minor in Data Analytics.

I am interested in pursuing full time employment in the corporate data analytics field after my graduation.

This summer, I will be interning in the health and benefits division of Willis Towers Watson in downtown Chicago. Additonally, I am proficient in R, Stata, and Excel, and I am currently enrolled in Python course at my university..

Feel free to connect with me on LinkedIn here.

Kevin:

My name is Kevin O’Connell and I am a sophomore Economics major from Chicago, Illinois.

I am minoring in Data Analytics and Sociology and am currently a Product Manager for Flyer Enterprises.

After college, I am interested in working in economic research, specifically studying social inequality.

Feel free to connect with me on LinkedIn here.

Project Limitations

Advanced statistics have become crucial to building teams in the modern era of baseball. These statistics, such as WAR, wRC+, and ZIPS predict player performance, which can be used by a franchise to gain a competitive advantage in a given season. Our dataset did not have the statistics capable of computing such predictive data, so we focused on basic stats that are known to have strong positive or negative correlations with scoring. Also, we initially planned on using a team’s rank as our dependent variable, but the dataset’s team rank variable was based on divisional rank, not if a team won the championship. We decided to study runs scored and runs allowed because teams that score more runs and allow less are more likely to be the best teams in the league, and thus are more likely to win the championship.

Column

Picture of Jesse

Jesse Devitt

Jesse Devitt

Picture of Kevin

Kevin O'Connell

Kevin O’Connell

---
title: "MLB Historical Ananlysis"
author: Devitt/O'Connell
output: 
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bootswatch: cosmo
      primary: "red"
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

<style>
.chart-title {  /* chart_title  */
   font-size: 20px;
  }
body{  /* Normal  */
      font-size: 18px;
  }
</style>

```{r setup, include=FALSE}
library(flexdashboard)
library(shiny)
library(shinydashboard)
```

Introduction
===
<head>
    <base target = "_blank">
</head>

<font size=5>
**MLB Team Success Trends Throughout Eras**
</font>


Column {data-width=650}
-----------------------------------------------------------------------

### Motivation

The purpose of this study was to perform statistical analysis on over 150 years of MLB single-season team statistical data in hopes of discovering historical success trends through eras. Runs (R) and runs allowed (RA) were used as the determinants for success.

This sort of analysis has been heavy utilized within the MLB over the past 20 years and is essential to franchise success.

From integration to juicing, baseball has changed repeatedly over the past 150 years. Thus, in order to discover historical trends, it was necessary to break up the data in accordance with the MLB's nine eras:

- Pre-1900 (1871-1900)
- Deadball (1901-1920)
- Liveball (1921-1942)
- Integration (1943-1961)
- Expansion (1962-1977)
- Free Agent (1978-1994)
- Steroid (1995-2004)
- Contemporary (2005-2014)
- Present Day (2015-)

Sources: 

[Sean Lahman Baseball Database](https://www.seanlahman.com/baseball-archive/)

[Baseball EDA](https://norcalbiostat.github.io/EDA/analysis.html)

[Time Series and Forecasting in R](https://viz.datascience.arizona.edu/2021-time-series-intro/time-series-forecasting.html)


```{r}
knitr::opts_chunk$set(echo = TRUE)
library(pacman)
library(tidyverse)
library(plotly)
library(maps)
library(DT)
library(corrplot)
library(readxl)
library(esquisse)
library(RColorBrewer)
library(stats)
library(forecast)
library(tsibble)
teams <- read_excel("MTH 208 Project Data.xlsx")
teams <- teams %>% mutate(period = ifelse(yearID %in% 1871:1900, "Pre1900",
                                   ifelse(yearID %in% 1901:1920, "DeadBall",
                                   ifelse(yearID %in% 1921:1942, "LiveBall", 
                                   ifelse(yearID %in% 1943:1961, "Integration",
                                   ifelse(yearID %in% 1962:1977, "Expansion",
                                   ifelse(yearID %in% 1978:1994, "FreeAgent",
                                   ifelse(yearID %in% 1995:2004, "Steroid",
                                   ifelse(yearID %in% 2005:2014, "Contemporary", "PresentDay")))))))))
teams <- teams %>% select(-c(lgID, franchID, teamID, teamIDBR, teamIDlahman45, teamIDretro))
teams <- teams %>% select(yearID, name, Rank, everything())

teams$OBP <- round(teams$OBP, 3)
teams$SLG <- round(teams$SLG, 3)
teams$OPS <- round(teams$OPS, 3)
teams$OPS <- as.numeric(format(teams$OPS, nsmall = 2, decimal.mark = ".", trim = TRUE))

# **bold** to make bold
# - to make bullet points
```

Column {data-width=350}
-----------------------------------------------------------------------

### Variable Index

The following explanatory variables were the focus of our analysis:

- Hits (H)

- Hits Allowed (HA)

- On-base Percentage (OBP)

- On-base Percentage plus Slugging Percentage (OPS)

- Walks (BB)

- Walks Allowed (BBA)

- Strikeout (SO)

- Strikeout Allowed (SOA)

- Home Runs (HR)

- Home Runs Allowed (HRA)

- Fielding Percentage (FP)

- Errors (E)

Era Division
===

Column {.tabset data-width=550}
---

The following interactive tables display all team statistics from all nine of the MLB's perspective eras.

### Pre-1900

```{r, echo=FALSE}
datatable(teams %>% filter(period=="Pre1900") %>% select(-period), options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)))
```

### Deadball

```{r, echo=FALSE}
datatable(teams %>% filter(period=="DeadBall") %>% select(-period), options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)))
```

### Liveball

```{r, echo=FALSE}
datatable(teams %>% filter(period=="LiveBall") %>% select(-period), options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)))
```

### Integration

```{r, echo=FALSE}
datatable(teams %>% filter(period=="Integration") %>% select(-period), options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)))
```

### Expansion

```{r, echo=FALSE}
datatable(teams %>% filter(period=="Expansion") %>% select(-period), options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)))
```

### Free Agent

```{r, echo=FALSE}
datatable(teams %>% filter(period=="FreeAgent") %>% select(-period), options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)))
```

### Steroid

```{r, echo=FALSE}
datatable(teams %>% filter(period=="Steroid") %>% select(-period), options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)))
```

### Contemporary

```{r, echo=FALSE}
datatable(teams %>% filter(period=="Contemporary") %>% select(-period), options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)))
```

### Present-Day

```{r, echo=FALSE}
datatable(teams %>% filter(period=="PresentDay") %>% select(-period), options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)))
```


EDA
===

Column {.tabset data-width=650}
---

### Runs by Year

```{r, echo=FALSE}
p1 <- ggplot(teams %>% group_by(yearID) %>% summarise(R = median(R), ), aes(x = yearID, y = R)) + geom_point(size = 1, color = "darkred") + theme_classic() + labs(y = "Runs", x = "Year", title = "MLB Median Runs Batted In per Year")
ggplotly(p1)
```

### Runs by Period

```{r, echo=FALSE}
teams$period <- factor(teams$period, levels = c( "Pre1900"  ,    "DeadBall"  ,   "LiveBall"  ,   "Integration" , "Expansion",   "FreeAgent"  ,  "Steroid"   ,   "Contemporary", "PresentDay"))
p2 <- ggplot(teams, aes(x = period, y = R)) + geom_boxplot(fill = "tan") + theme_classic() + labs(y = "Runs", x = "MLB Era", title = "Distribution of MLB Team Runs Batted In by Era") +  theme(axis.text.x = element_text(angle = 25, hjust = 1))
ggplotly(p2)
```

Column {.tabset data-width=650}
---

### Explanation

The scatterplot to the left displays the median runs earned per season of the MLB; from the plot, it is clear there exists a very weak positive correlation between time and runs earned; a potential reason for this could be that although batting has vastly improved overtime, the explosion of offensive talent has been mitigated over time by an additional steady improvement in ptiching and fielding.

The boxplot of runs for each time period shows how scoring has changed throughout the history of baseball. The Pre-1900 Era has the widest range because the league was not regulated, so the player skill level varied wildly, thus leading to highly-skilled teams beating up on teams with subpar players. The Dead Ball Era has the lowest median of any era, which aligns with the reasoning behind the era’s name. The Steroid ERA has the highest median, which makes sense given the increase in home runs due to performance-enhancing drugs.




Correlation Exploration
===

The following correlelograms display, using color, the various correlations within both our offensive and defensive explanatory variable groups.

Column {.tabset data-width=600}
---

### Pre-1900

```{r, echo=FALSE}
par(mfrow = c(1, 2))
teams.Pre1900Runs<-teams %>% filter(period=="Pre1900") %>% select(c(R,OBP,OPS,BB,SO,H,HR)) %>% cor()
corrplot(teams.Pre1900Runs, method = c("color"),type="upper",main="Pre1900Runs",mar=c(0,0,1,0))
teams.Pre1900RunsAllowed<-teams %>% filter(period=="Pre1900") %>% select(c(RA,FP,SOA,BBA,HA,HRA,E)) %>% cor()
corrplot(teams.Pre1900RunsAllowed, method = c("color"),type="upper",main="Pre1900RunsAllowed",mar=c(0,0,1,0))
```

### Dead Ball

```{r, echo=FALSE}
par(mfrow = c(1, 2))
teams.DeadBallRuns<-teams %>% filter(period=="DeadBall") %>% select(c(R,OBP,OPS,BB,SO,H,HR)) %>% filter(complete.cases(.)) %>% cor()
corrplot(teams.DeadBallRuns, method = c("color"),type="upper",main="DeadBallRuns",mar=c(0,0,1,0))
teams.DeadBallRunsAllowed<-teams %>% filter(period=="DeadBall") %>% select(c(RA,FP,SOA,BBA,HA,HRA,E)) %>% cor()
corrplot(teams.DeadBallRunsAllowed, method = c("color"),type="upper",main="DeadBallRunsAllowed",mar=c(0,0,1,0))
```

### Live Ball

```{r, echo=FALSE}
par(mfrow = c(1, 2))
teams.LiveBallRuns<-teams %>% filter(period=="LiveBall") %>% select(c(R,OBP,OPS,BB,SO,H,HR)) %>% cor()
corrplot(teams.LiveBallRuns, method = c("color"),type="upper",main="LiveBallRuns",mar=c(0,0,1,0))
teams.LiveBallRunsAllowed<-teams %>% filter(period=="LiveBall") %>% select(c(RA,FP,SOA,BBA,HA,HRA,E)) %>% cor()
corrplot(teams.LiveBallRunsAllowed, method = c("color"),type="upper",main="LiveBallRunsAllowed",mar=c(0,0,1,0))
```



### Integration

```{r, echo=FALSE}
par(mfrow = c(1, 2))
teams.IntegrationRuns<-teams %>% filter(period=="Integration") %>% select(c(R,OBP,OPS,BB,SO,H,HR)) %>% cor()
corrplot(teams.IntegrationRuns, method = c("color"),type="upper",main="IntegrationRuns",mar=c(0,0,1,0))
teams.IntegrationRunsAllowed<-teams %>% filter(period=="Integration") %>% select(c(RA,FP,SOA,BBA,HA,HRA,E)) %>% cor()
corrplot(teams.IntegrationRunsAllowed, method = c("color"),type="upper",main="IntegrationRunsAllowed",mar=c(0,0,1,0))
```

### Expansion

```{r, echo=FALSE}
par(mfrow = c(1, 2))
teams.ExpansionRuns<-teams %>% filter(period=="Expansion") %>% select(c(R,OBP,OPS,BB,SO,H,HR)) %>% cor()
corrplot(teams.ExpansionRuns, method = c("color"),type="upper",main="ExpansionRuns",mar=c(0,0,1,0))
teams.ExpansionRunsAllowed<-teams %>% filter(period=="Expansion") %>% select(c(RA,FP,SOA,BBA,HA,HRA,E)) %>% cor()
corrplot(teams.ExpansionRunsAllowed, method = c("color"),type="upper",main="ExpansionRunsAllowed",mar=c(0,0,1,0))
```

### Free Agent

```{r, echo=FALSE}
par(mfrow = c(1, 2))
teams.FreeAgentRuns<-teams %>% filter(period=="FreeAgent") %>% select(c(R,OBP,OPS,BB,SO,H,HR)) %>% cor()
corrplot(teams.FreeAgentRuns, method = c("color"),type="upper",main="FreeAgentRuns",mar=c(0,0,1,0))
teams.FreeAgentRunsAllowed<-teams %>% filter(period=="FreeAgent") %>% select(c(RA,FP,SOA,BBA,HA,HRA,E)) %>% cor()
corrplot(teams.FreeAgentRunsAllowed, method = c("color"),type="upper",main="FreeAgentRunsAllowed",mar=c(0,0,1,0))
```


### Steroid

```{r, echo=FALSE}
par(mfrow = c(1, 2))
teams.SteroidRuns<-teams %>% filter(period=="Steroid") %>% select(c(R,OBP,OPS,BB,SO,H,HR)) %>% cor()
corrplot(teams.SteroidRuns, method = c("color"),type="upper",mar=c(0,0,1,0))
teams.SteroidRunsAllowed<-teams %>% filter(period=="Steroid") %>% select(c(RA,FP,SOA,BBA,HA,HRA,E)) %>% cor()
corrplot(teams.SteroidRunsAllowed, method = c("color"),type="upper",mar=c(0,0,1,0))
```

### Contemporary

```{r, echo=FALSE}
par(mfrow = c(1, 2))
teams.ContemporaryRuns<-teams %>% filter(period=="Contemporary") %>% select(c(R,OBP,OPS,BB,SO,H,HR)) %>% cor()
corrplot(teams.ContemporaryRuns, method = c("color"),type="upper",main="ContemporaryRuns",mar=c(0,0,1,0))
teams.ContemporaryRunsAllowed<-teams %>% filter(period=="Contemporary") %>% select(c(RA,FP,SOA,BBA,HA,HRA,E)) %>% cor()
corrplot(teams.ContemporaryRunsAllowed, method = c("color"),type="upper",main="ContemporaryRunsAllowed",mar=c(0,0,1,0))
```

### Present-Day

```{r, echo=FALSE}
par(mfrow = c(1, 2))
teams.PresentDayRuns<-teams %>% filter(period=="PresentDay") %>% select(c(R,OBP,OPS,BB,SO,H,HR)) %>% cor()
corrplot(teams.PresentDayRuns, method = c("color"),type="upper",main="PresentDayRuns",mar=c(0,0,1,0))
teams.PresentDayRunsAllowed<-teams %>% filter(period=="PresentDay") %>% select(c(RA,FP,SOA,BBA,HA,HRA,E)) %>% cor()
corrplot(teams.PresentDayRunsAllowed, method = c("color"),type="upper",main="PresentDayRunsAllowed",mar=c(0,0,1,0))
```


Column {data-width=400}
---

### Results

Based on our correlograms of each era, we found hits and OPS to be the most important statistics in influencing the amount of runs scored, as one or both of the two had the highest correlation with runs. When comparing the Dead Ball and Live Ball Eras, we saw an increase in the correlation between OPS and runs, which is likely due to the introduction of power hitting to the sport. OPS’s correlation with runs decreased between the Contemporary and Present Day Eras, which makes sense when considering the developments of the past decade. The defensive shift and power-focused hitting approaches have grown in popularity, limiting the amount of players getting on base compared to other eras. Also, more pitchers have extremely high spin rates, which makes pitches much harder to hit. Prior to the 2023 season, the MLB implemented new rules in an attempt to create more offense during games. The defensive shift is now banned, so hitters, especially left-handed ones, will get on base more often. Despite this decrease in OPS correlation, hits still hold the strongest correlation with runs in the Present Day Era.

Era Trends
===

Column {.tabset data-width=650}
---

### Runs Allowed

```{r, echo=FALSE}
p3 <- ggplot(teams, aes(x =yearID, y = RA)) + geom_point(size = 1, aes(x =yearID, y = RA, col=period, text = paste0(name, ":\n", RA, " runs allowed in ", yearID))) + geom_smooth(se=FALSE) + ggtitle("Runs Allowed over Time") + xlab("Year") + ylab("Runs Allowed") + scale_color_brewer(palette = "Set3") + theme_classic()
ggplotly(p3,tooltip = "text")
```

### Hits Allowed

```{r, echo=FALSE}
p4 <- ggplot(teams, aes(x =yearID, y = HA)) + geom_point(size = 1, aes(x =yearID, y = HA, col=period, text = paste0(name, ":\n", HA, " hits allowed in ", yearID))) + geom_smooth(se=FALSE) + ggtitle("Hits Allowed over Time") + xlab("Year") + ylab("Hits Allowed") + scale_color_brewer(palette = "Set3") + theme_classic()
ggplotly(p4, tooltip = "text")
```

### Runs

```{r, echo=FALSE}
p5 <- ggplot(teams, aes(x =yearID, y = R)) + geom_point(size = 1, aes(x =yearID, y = R, col=period, text = paste0(name, ":\n", R, " runs earned in ", yearID))) + geom_smooth(se=FALSE) + ggtitle("Team Runs Earned over Time") + xlab("Year") + ylab("Runs") + scale_color_brewer(palette = "Set3") + theme_classic()
ggplotly(p5, tooltip = "text")
```

### Hits
```{r, echo=FALSE}
p6 <- ggplot(teams, aes(x =yearID, y = H)) + geom_point(size = 1, aes(x =yearID, y = H, col=period, text = paste0(name, ":\n", H, " hits earned in ", yearID))) + geom_smooth(se=FALSE) + ggtitle("Team Hits over Time") + xlab("Year") + ylab("Hits") + scale_color_brewer(palette = "Set3") + theme_classic()
ggplotly(p6, tooltip = "text")
```

### OPS
```{r, echo=FALSE}
p7 <- ggplot(teams, aes(x =yearID, y = OPS)) + geom_point(size = 1, aes(x =yearID, y = OPS, col=period, text = paste0(name, ":\n", " team OPS of ", OPS, " in ", yearID))) + geom_smooth(se=FALSE) + ggtitle("Team OPS over Time") + xlab("Year") + ylab("OPS") + scale_color_brewer(palette = "Set3") + theme_classic()
ggplotly(p7, tooltip = "text")
```

Column {.tabset data-width=500}
---

### Explanation

The scatterplots of runs allowed and hits allowed have similar trend lines, with peaks in the Live Ball Era and the transition between Steroid and Contemporary Eras. This similarity is rational because the greater the amount of hits a team gives up, the greater the number of scoring opportunities for the opposition, thus the team would likely allow more runs. 

Additionally, hits has a strong positive correlation with runs, much like runs allowed and hits allowed. Moreover, the correlation between runs earned and OPS must be mentioned as well.
Baseball philosophy has recently become focused on more efficient ways of scoring, which has made the OPS (On-Base Plus Slugging Percentage) statistic relevant. OPS is used to measure how well a player can hit for average and for power. Home runs and extra base hits are weighed more heavily than singles and walks, so a team with a higher OPS gets on base at a higher rate and hits more home runs, doubles, and triples than a team with a lower OPS. Hits peak in the Live Ball Era and at the transition between Steroid and Contemporary Eras, and OPS has risen steadily overall, as the 2022 average team OPS is higher than all seasons before the beginning of the Steroid Era.

### Historical Context

The Pre1900 Era was the birth of baseball, which caused a wide range of skill levels to be present on major league teams, hence the spread of runs. In the Dead Ball Era, the baseball was not wound tightly, which limited players’ ability to hit for power and distance. The Live Ball Era saw a major increase in runs because the ball was wound tighter and rules were enacted to ban manipulation of the ball. These rules opened the door for the quick change of the game from “small ball” (singles, bunts, stolen bases) to home run hitting. The Integration Era saw minorities being allowed into the MLB as well as the near collapse of the league during World War II. Many players joined the military, and attendance declined along with the quality of play. The Expansion and Free Agent Eras brought on national growth and player autonomy, but also competed with the newly popular NFL and NBA. After the 1994-95 strike, the MLB was in dire straits. Then, PEDs took over the game, with steroid-users demolishing home run records. This offensive explosion revived the MLB, but in the mid-2010s officials cracked down on steroid use, which started the decline of scoring. Now, hits are harder to come by, as higher velocities and spin rates of pitches couple with the defensive shift to limit opportunities for run scoring.




Time Series
===

Column {.tabset data-width=500}
---

### Runs Allowed Model

```{r, echo=FALSE}
dfRA <- teams %>% group_by(yearID) %>% summarise(RA = (median(RA)), HA = median(HA)) %>% filter(yearID != 2020)

dfRA1 <- data.frame(
  year = dfRA$yearID,
  HA = dfRA$HA,
  RA = dfRA$RA)
ts_RA <- as_tsibble(dfRA1, index = year)
modelRA <- lm(RA ~ HA + year, data = dfRA1)
new_dataRA <- data.frame(year = c(dfRA1$year[], 2023, 2024, 2025), HA = c(dfRA1$HA[], rep(1389, 3)))
predictionsRA <- predict(modelRA, newdata = new_dataRA)
#predictions[152:154]
RAhat <- data.frame(
  year = c(dfRA1$year[], 2023, 2024, 2025),
  HA = c(dfRA1$HA[], rep(1389, 3)),
  RA = predictionsRA[]
)
ts_RAhat <- as_tsibble(RAhat, index = year)
#ggplot(ts_RAhat, aes(x = year, y = RA, color = HA)) + geom_line()
summary(modelRA)
tail(RAhat, 3)
```

### Runs Model

```{r, echo=FALSE}
dfR <- teams %>% group_by(yearID) %>% summarise(R = median(R), H = median(H), OPS = median(OPS)) %>% filter(yearID != 2020)
dfR1 <- data.frame(
  year = dfR$yearID,
  H = dfR$H,
  R = dfR$R,
  OPS = dfR$OPS)
ts_R <- as_tsibble(dfR1, index = year)
#ggplot(ts_R, aes(x = year, y = R, color = OPS)) + geom_line()
modelR <- lm(R ~ OPS + H + year, data = dfR1)
new_dataR <- data.frame(year = c(dfR1$year[], 2023, 2024, 2025), H = c(dfR1$H[], rep(1389, 3)), OPS = c(dfR1$OPS[], rep(0.712, 3)))
predictionsR <- predict(modelR, newdata = new_dataR)
Rhat <- data.frame(
  year = c(dfR1$year[], 2023, 2024, 2025),
  H = c(dfR1$H[], rep(1389, 3)),
  R = predictionsR[],
  OPS = c(dfR1$OPS[], rep(0.712, 3))
)
ts_Rhat <- as_tsibble(Rhat, index = year)
#ggplot(ts_Rhat, aes(x = year, y = R, color = OPS)) + geom_line()
summary(modelR)
tail(Rhat, 3)
```

Column {.tabset data-width=500}
---

### Runs Trend

```{r, echo=FALSE}
ggplot(ts_RA, aes(x = year, y = RA, color = HA)) + geom_line() + theme_classic() + labs(x = "Year", y = "Runs Allowed", title = "Median Runs Allowed By MLB Season", subtitle = "Time Series", color = "Hits Allowed")
```

### Findings

We noticed a decreasing trend in the amount of runs being scored over the last decade. The Present Day Era has seen a drastic decrease in scoring compared to previous eras, so we used a time series to test if this downward trend would continue. Our findings show a negative correlation between runs and year, as well as runs allowed and year. For the runs allowed model, we saw a negative correlation of 0.32 runs allowed per season year over year, controlling for hits allowed and using the median values of runs allowed. This effect of year on runs allowed is statistically significant at a 5% significance level. The model used for runs shows a negative correlation of 0.71 runs per season year over year, controlling for hits and OPS and using the median values of runs. This effect is significant at a 1% level, which shows the impact better pitching and less-disciplined hitting has had on the game of baseball. The runs model accounted for 84% of the variation in runs year over year, and the runs allowed model accounted for 74% of the variation in runs allowed year over year. Based on the time series, we predicted the median number of runs for each of the next three seasons (2023-2025) while keeping our predictor variables of hits, OPS, and hits allowed constant at their median values for the entire dataset; these predictions are displayed underneath the two linear regression models, and support the negative correlation found.


About the Authors
===

Column {.tabset data-width=500}
---

### Our Background

Jesse:

My name is Jesse Devitt and I am an undergraduate student at the University
of Dayton. Currently, my projected graduation is May 2024.

Right now, I am working towards completing a B.S. in Applied Mathematical Economics with a minor in Data Analytics.

I am interested in pursuing full time employment in the corporate data analytics field after my graduation.

This summer, I will be interning in the health and benefits division of Willis Towers Watson in downtown Chicago. Additonally, I am proficient in R, Stata, and Excel, and I am currently enrolled in Python course at my university.. 

Feel free to connect with me on LinkedIn [here](https://www.linkedin.com/in/jesse-devitt-0a2305228/).


Kevin:

My name is Kevin O’Connell and I am a sophomore Economics major from Chicago, Illinois.

I am minoring in Data Analytics and Sociology and am currently a Product Manager for Flyer Enterprises.

After college, I am interested in working in economic research, specifically studying social inequality.

Feel free to connect with me on LinkedIn
[here](https://www.linkedin.com/in/kevino-connell/). 

### Project Limitations

Advanced statistics have become crucial to building teams in the modern era of baseball. These statistics, such as WAR, wRC+, and ZIPS predict player performance, which can be used by a franchise to gain a competitive advantage in a given season. Our dataset did not have the statistics capable of computing such predictive data, so we focused on basic stats that are known to have strong positive or negative correlations with scoring. Also, we initially planned on using a team’s rank as our dependent variable, but the dataset’s team rank variable was based on divisional rank, not if a team won the championship. We decided to study runs scored and runs allowed because teams that score more runs and allow less are more likely to be the best teams in the league, and thus are more likely to win the championship. 

Column {.tabset data-width=500}
---

### Picture of Jesse

```{r , fig.width=6, echo=FALSE, fig.cap="Jesse Devitt", fig.align='right'}
knitr::include_graphics("MTH208.jpg")
```

### Picture of Kevin

```{r , fig.width=6, echo=FALSE, fig.cap="Kevin O'Connell", fig.align='right'}
knitr::include_graphics("Kevin O’Connell.JPG")
```